Skip to content

Conversation

yf225
Copy link
Contributor

@yf225 yf225 commented Aug 18, 2025

As discussed in https://fb.workplace.com/groups/257735836456307/posts/967097558853461/?comment_id=971504168412800&reply_comment_id=971545435075340, the normal triton.testing.do_bench measurement includes torch.compile Dynamo invocation overhead and doesn't reflect real-world model use case where Dynamo overhead is usually hidden.

This PR adds an option to use Inductor benchmarker as the timing measurement tool, which uses cuda event for timing measurement and thus more accurately measuring only the CUDA kernel runtime.

Usage: --latency-measure-mode inductor_benchmarker

@FindHao
Copy link
Member

FindHao commented Aug 18, 2025

Besides that, I recalled @xuzhao9 mentioned somewhere we can use kineto's trace to sum only GPU time.

@FindHao
Copy link
Member

FindHao commented Aug 18, 2025

I'm not sure about this change. Will this benchmark function change the results for pure handwritten triton kernel's results? If it matches the results of triton.testing.do_bench for other triton kernels, I feel we should just use this new function. If not, we need to understand what are the differences and see how to fix it.

Adding a new latency_measure_mode doesn't sound good because the results are diverged.

what do you think? @xuzhao9

@FindHao
Copy link
Member

FindHao commented Aug 18, 2025

which uses cuda event for timing measurement and thus more accurately measuring only the CUDA kernel runtime.

https://github.com/triton-lang/triton/blob/37f265932b68868021b2fade6354b44e613dc124/python/triton/testing.py#L163
do_bench also use cuda event for timling. I may need to double check the implementation of benchmarker.benchmark_gpu

@BoyuanFeng
Copy link

BoyuanFeng commented Aug 18, 2025

@FindHao benchmarking.benchmark_gpu seems to be better for benchmarks than triton.testing.do_bench. Please check this scripts for comparison: P1908752761

image

use kineto's trace to sum only GPU time.

This sounds good! It would be great if we have a pytorch utility to benchmark only gpu time. We can converge all kernel benchmarks to use it if possible.

@FindHao
Copy link
Member

FindHao commented Aug 18, 2025

benchmarking.benchmark_gpu seems to be better for benchmarks than triton.testing.do_bench.

Can you clarify what's the differences between these two implementations?

What I mean is: if you claim that the Dynamo overhead is completely hidden in real runs, I am willing to trust that. However, we still need to verify that the results from benchmark_gpu are consistent with those from do_bench on other handwritten Triton kernels.

Alternatively, measuring only the pure GPU time may be misleading, since it could overlook real Python or Triton runtime overhead.

@yf225
Copy link
Contributor Author

yf225 commented Aug 18, 2025

measuring only the pure GPU time may be misleading, since it could overlook real Python or Triton runtime overhead

Would we be open to having a mode that measures only GPU time? I believe for real-world model use causes, the kernel launch or Python overhead is usually hidden by the previous kernel (assuming no bubble), so having a mode that only measures GPU time is useful and reflects that scenario too.

@FindHao
Copy link
Member

FindHao commented Aug 18, 2025

measuring only the pure GPU time may be misleading, since it could overlook real Python or Triton runtime overhead

Would we be open to having a mode that measures only GPU time? I believe for real-world model use causes, the kernel launch or Python overhead is usually hidden by the previous kernel (assuming no bubble), so having a mode that only measures GPU time is useful and reflects that scenario too.

Try --metrics cuda_time.

| `cuda_time` | Sum of all GPU-side kernels time of an operator backend, measured by Kineto and PyTorch Profiler. |

Is this what you need?

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Aug 19, 2025
Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden.

I also opened a PR to use this timing measurement function on TritonBench side: meta-pytorch/tritonbench#333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate.

Pull Request resolved: #160921
Approved by: https://github.com/BoyuanFeng
@xuzhao9
Copy link
Contributor

xuzhao9 commented Aug 20, 2025

Due to the limitation of triton's do_bench/do_bench_cudagraph, it is beneficial to have inductor benchmarker and more latency measurement modes (e.g. power-limit-aware benchmarking) in the future.

x.grad = None

# Measure only the function execution time
ms_time = benchmarker.benchmark_gpu(fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering does inductor benchmarker use cudagraph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it doesn't use cudagraph but use CUDA event timing (e.g. start_event.elapsed_time(end_event))

can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
…ch#160921)

Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden.

I also opened a PR to use this timing measurement function on TritonBench side: meta-pytorch/tritonbench#333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate.

Pull Request resolved: pytorch#160921
Approved by: https://github.com/BoyuanFeng
@yf225 yf225 had a problem deploying to docker-s3-upload August 27, 2025 23:56 — with GitHub Actions Failure
@yf225 yf225 had a problem deploying to docker-s3-upload August 27, 2025 23:56 — with GitHub Actions Failure
@yf225 yf225 force-pushed the inductor_benchmarker branch from 37509ff to 1a758b2 Compare September 2, 2025 19:52
@yf225 yf225 temporarily deployed to docker-s3-upload September 2, 2025 19:52 — with GitHub Actions Inactive
@yf225 yf225 temporarily deployed to docker-s3-upload September 2, 2025 19:52 — with GitHub Actions Inactive
@yf225 yf225 temporarily deployed to docker-s3-upload September 2, 2025 20:07 — with GitHub Actions Inactive
@yf225 yf225 temporarily deployed to docker-s3-upload September 2, 2025 20:07 — with GitHub Actions Inactive
@yf225 yf225 merged commit 206b93c into main Sep 2, 2025
6 checks passed
@yf225 yf225 deleted the inductor_benchmarker branch September 2, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants